Make checkpoint tests fail on missing required binding symbols#2150
Make checkpoint tests fail on missing required binding symbols#2150rwgk wants to merge 2 commits into
Conversation
Ensure checkpoint tests distinguish missing required cuda.bindings symbols from genuinely unsupported environments.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
PR 2150 first CI failure analysisWorkflow: https://github.com/NVIDIA/cuda-python/actions/runs/26591678170 Commit: 293258d Workflow result: failed. High-level resultThe build and non-test infrastructure mostly passed:
The failures are concentrated in test matrix jobs. There were 37 failed test jobs plus the final status aggregation job. Failure counts by CUDA version:
Failure counts by platform:
Failure mode 1: CUDA 13.3 missing
|
|
I looked into the CUDA 12.9 failures from the first PR #2150 CI run. The short version: these failures look separate from the CUDA 13.3 In grep -r -i GpuPair /usr/local/cuda-12.9returns no matches. The CUDA 12.9 typedef struct CUcheckpointRestoreArgs_st {
cuuint64_t reserved[8]; /**< Reserved for future use, must be zeroed */
} CUcheckpointRestoreArgs;That matches the CUDA 12.9 CI failure mode from https://github.com/NVIDIA/cuda-python/actions/runs/26591678170: Linux CUDA 12.9 jobs now fail during So my current interpretation is:
Possible follow-up direction: keep missing required symbols as failures for APIs that should exist in the active CUDA version, but treat the CUDA 12.9/no- |
Keep baseline CUDA checkpoint coverage active for CUDA versions whose headers do not expose GPU remapping structs, while still failing when required base checkpoint bindings such as CUcheckpointRestoreArgs are missing. Gate only the GPU migration path on CUcheckpointGpuPair so CUDA 12.9 can exercise state, lock, checkpoint, restore-without-mapping, and unlock.
|
/ok to test |
PR 2150 second CI failure analysisWorkflow: https://github.com/NVIDIA/cuda-python/actions/runs/26596635176 Commit: cd730c1 Current workflow state at inspection time:
High-level resultThe second CI run matches expectations after splitting baseline checkpoint support from GPU-remapping support. All completed failures are CUDA 13.3.0 test jobs. CUDA 12.9.1 and CUDA 13.0.2 jobs that completed are passing. Failure counts by CUDA version:
Failure counts by platform:
Remaining failure mode: CUDA 13.3 missing
|
leofang
left a comment
There was a problem hiding this comment.
Do we really need this PR as-is? It seems pretty AI-slop to me... If binding is broken, we just fix it and move on. Same if it's the codegen that's broken. I don't think adding tests to either cuda-bindings/cuda-core like this is maintainable.
|
Thanks Leo, fair concern on maintainability. This was AI-assisted, but I manually guided and reviewed it. The intent is not to build a broad ad hoc API coverage framework in this PR. The new cuda_bindings test is a deliberately narrow regression guard for the exact checkpoint binding surface that silently disappeared in 13.3.0. Keith also raised the broader version of this in Slack: we should have something that enumerates all public APIs and types per CUDA version so parser failures cannot silently drop symbols. I agree with that, but it is a much larger ask. For cuda_core, the goal is also narrower than adding new checkpoint scenario coverage. The existing checkpoint tests already covered the restore path, but the availability helper treated missing required bindings as an unsupported-environment skip. This PR fixes that boundary: unsupported drivers/old bindings still skip, but missing required symbols now fail. The added cuda_core tests are focused on that skip/fail behavior and on separating baseline checkpoint support from GPU remapping support. |
Closes #2149
Summary
cuda.corecheckpoint test availability guard so it still skips true unsupported environments, but no longer skips missing requiredcuda.bindingssymbols.cuda.bindingscompleteness test for the checkpoint symbols required bycuda.core.checkpoint, includingCUcheckpointRestoreArgs.Context
This is a follow-up to #2144 and fixes the test coverage gap tracked in #2149.
The CUDA 13.3.0
CUcheckpointRestoreArgsgeneration issue fixed by #2144 could pass the existing test flow because thecuda.corecheckpoint tests treated allRuntimeErrors fromcheckpoint._get_driver()as an unsupported environment. That included:This PR keeps the intended skips for genuinely unsupported configurations, but lets missing required binding attributes propagate as test failures.
Validation
On the pre-#2144 base, these focused tests now expose the breakage:
fails during collection with:
and:
fails with:
After PR #2144 lands and this branch is rebased onto it, the focused checkpoint tests should pass and demonstrate that the original generation issue is fixed while the error-masking skip is closed.
Related